Add capability boundary replay artifact#152
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a system for generating and testing deterministic capability-boundary replay artifacts. It includes a generation script that extracts boundary graphs from JSON payloads, a sample artifact file, and a comprehensive test suite. Review feedback focuses on optimizing memory efficiency by refactoring the graph extraction logic to use iterables and generator expressions, ensuring the script can handle large numbers of payloads without excessive memory consumption.
| import sys | ||
| from collections import defaultdict | ||
| from pathlib import Path | ||
| from typing import Any |
| def _extract_boundary_graph(payloads: list[dict[str, Any]]) -> tuple[tuple[tuple[str, str], ...], tuple[str, ...]]: | ||
| edges: list[tuple[str, str]] = [] | ||
| nodes: set[str] = set() | ||
|
|
||
| for relation_key in SUPPORTED_RELATION_KEYS: | ||
| for payload in payloads: | ||
| for relation_value in _collect_relation_values(payload, relation_key): | ||
| rel_edges, rel_nodes = _extract_relation_data(relation_value, relation_key) | ||
| edges.extend(rel_edges) | ||
| nodes.update(rel_nodes) |
There was a problem hiding this comment.
The current implementation of _extract_boundary_graph iterates over the payloads list for each key in SUPPORTED_RELATION_KEYS. By swapping the loops, you can process each payload fully in a single pass. This also allows the function to accept an Iterable instead of a list, which is necessary for memory-efficient processing of large numbers of payloads using generators.
| def _extract_boundary_graph(payloads: list[dict[str, Any]]) -> tuple[tuple[tuple[str, str], ...], tuple[str, ...]]: | |
| edges: list[tuple[str, str]] = [] | |
| nodes: set[str] = set() | |
| for relation_key in SUPPORTED_RELATION_KEYS: | |
| for payload in payloads: | |
| for relation_value in _collect_relation_values(payload, relation_key): | |
| rel_edges, rel_nodes = _extract_relation_data(relation_value, relation_key) | |
| edges.extend(rel_edges) | |
| nodes.update(rel_nodes) | |
| def _extract_boundary_graph(payloads: Iterable[dict[str, Any]]) -> tuple[tuple[tuple[str, str], ...], tuple[str, ...]]: | |
| edges: list[tuple[str, str]] = [] | |
| nodes: set[str] = set() | |
| for payload in payloads: | |
| for relation_key in SUPPORTED_RELATION_KEYS: | |
| for relation_value in _collect_relation_values(payload, relation_key): | |
| rel_edges, rel_nodes = _extract_relation_data(relation_value, relation_key) | |
| edges.extend(rel_edges) | |
| nodes.update(rel_nodes) |
| original_payloads = [_load_json(path) for path in _discover_payload_files(fixture_root / "original")] | ||
| replay_payloads = [_load_json(path) for path in _discover_payload_files(fixture_root / "reconstructed")] | ||
|
|
||
| original_edges, original_nodes = _extract_boundary_graph(original_payloads) | ||
| replay_edges, replay_nodes = _extract_boundary_graph(replay_payloads) |
There was a problem hiding this comment.
Loading all JSON payloads into memory at once using list comprehensions can be very memory-intensive as the number and size of fixtures grow. Using generator expressions combined with the updated _extract_boundary_graph (accepting an Iterable) significantly reduces the memory footprint by processing files one at a time.
| original_payloads = [_load_json(path) for path in _discover_payload_files(fixture_root / "original")] | |
| replay_payloads = [_load_json(path) for path in _discover_payload_files(fixture_root / "reconstructed")] | |
| original_edges, original_nodes = _extract_boundary_graph(original_payloads) | |
| replay_edges, replay_nodes = _extract_boundary_graph(replay_payloads) | |
| original_edges, original_nodes = _extract_boundary_graph( | |
| _load_json(path) for path in _discover_payload_files(fixture_root / "original") | |
| ) | |
| replay_edges, replay_nodes = _extract_boundary_graph( | |
| _load_json(path) for path in _discover_payload_files(fixture_root / "reconstructed") | |
| ) |
Motivation
Description
scripts/generate_capability_boundary_replay_artifact.pythat loadsfixtures/manifest.json, readsoriginal/*.jsonandreconstructed/*.jsonpayloads, conservatively extracts only explicit structured capability-boundary data from supported keys, normalizes edges/nodes, and compares original vs reconstructed graphs usingnormalize_edges,nodes_from_edges, andcompare_edgesfrom the graph core.artifacts/capability_boundary_replay_results.jsonfollowing the stable schema (artifact_id,generated_by,version,evaluation_mode,llm_judges,external_apis,families,global_summary) with deterministic ordering and no timestamps or environment fields.tests/test_capability_boundary_replay_artifact.pythat assert artifact existence, exact regeneration parity, top-level schema stability, determinism/sanitization, manifest alignment (family/fixture counts and IDs), capability-boundary evidence/drift behavior (including zero-data handling), and label discipline using only registered failure labels.Testing
python scripts/generate_capability_boundary_replay_artifact.pywhich producesartifacts/capability_boundary_replay_results.jsonand matches the committed artifact exactly.pytest tests/test_capability_boundary_replay_artifact.py -q(all tests passed) and additionally ranpytestsuites relied on (tests/test_graph_diff_artifact.py,tests/test_replay_graph_core.py,tests/test_fixture_manifest.py,tests/test_failure_taxonomy.py) as well as the full test run where262 passedwas observed undernpm run check.Codex Task